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As advances in technology allow for the collection, storage, and analysis of vast amounts of data, 
the task of screening and assessing the significance of discovered patterns is becoming a major challenge 
in data mining applications. In this work, we address significance in the context of frequent itemset 
mining. Specifically, we develop a novel methodology to identify a meaningful support threshold s* for 
a dataset, such that the number of itemsets with support at least s* represents a substantial deviation 
from what would be expected in a random dataset with the same number of transactions and the same 
individual item frequencies. These itemsets can then be flagged as statistically significant with a small 
false discovery rate. We present extensive experimental results to substantiate the effectiveness of our 
methodology. 

1 Introduction 

The discovery of frequent itemsets in transactional datasets is a fundamental primitive that arises in the 
mining of association rules and in many other scenarios [151125] . In its original formulation, the problem 
requires that given a dataset T> of transactions over a set of items X, and a support threshold s, all itemsets 
X C I with support at least s in T> (i.e., contained in at least s transactions) be returned. These high-support 
itemsets are referred to as frequent itemsets. 

Since the pioneering paper by Agrawal et al. [2J, a vast literature has flourished, addressing variants of the 
problem, studying foundational issues, and presenting novel algorithmic strategics or clever implementations 
of known strategies (see, e.g., [HJH2]), but many problems remain open [TJ]. In particular, assessing the 
significance of the discovered itemsets, or equivalently, flagging statistically significant discoveries with a 
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limited number of false positive outcomes, is still poorly understood and remains one of the most challenging 
problems in this area. 

The classical framework requires that the user decide what is significant by specifying the support thresh- 
old s. Unless specific domain knowledge is available, the choice of such a threshold is often arbitrary [T51I25] 
and may lead to a large number of spurious discoveries (false positives) that would undermine the success 
of subsequent analysis. 

In this paper, we develop a rigorous and efficient novel approach for identifying frequent itemsets featuring 
both a global and a pointwise guarantee on their statistical significance. Specifically, we flag as significant 
a population of itemsets extracted with respect to a certain threshold, if some global characteristics of the 
population deviate considerably from what would be expected if the dataset were generated randomly with 
no correlations between items. Also, we make sure that a large fraction of the itemsets belonging to the 
returned population are individually significant by enforcing a small False Discovery Rate (FDR) [4] for the 
population. 

1.1 The model 

As mentioned above, the significance of a discovery in our framework is assessed based on its deviation 
from what would be expected in a random dataset in which individual items are placed in transactions 
independently. 

Formally, let V be a dataset of t transactions on a set I of n items, where each transaction is a subset of 
I. Let n(i) be the number of transactions that contain item i and let /; = n(i)/t be the frequency of item i 
in the dataset. The support of an itemset X C X is defined as the number of transactions that contain X . 
Following [23], the dataset V is associated with a probability space of datasets, all featuring the same number 
of transactions t on the same set of items X as T> 1 and in which item i is included in any given transaction 
with probability /j, independently of all other items and all other transactions. A similar model is used 
in [20] and [21] to evaluate the running time of algorithms for frequent itemsets mining. For a fixed integer 
k > 1, among all possible (^) itemsets of size k (k-itemsets) we are interested in identifying statistically 
significant ones, that is, those A:-itemsets whose supports are significantly higher, in a statistical sense, than 
their expected supports in a dataset drawn at random from the aforementioned probability space. 

An alternative probability space of datasets, proposed in [TU], considers all arrangements of n items into 
to transactions which match the exact item frequencies and transaction lengths as V. Conceivably, the 
technique of this paper could be adapted to this latter model as well. 

1.2 Multi-hypothesis testing 

In a simple statistical test, a null hypothesis Hq is tested against an alternative hypothesis H\. A test 
consists of a rejection (critical) region C such that, if the statistic (outcome) of the experiment is in C, then 
the null hypothesis is rejected, and otherwise the null hypothesis is not rejected. The significance level of a 
test, a = Pr(Type I error), is the probability of rejecting Hq when it is true (false positive). The power of 
the test, 1 — Pr(Type II error), is the probability of correctly rejecting the null hypothesis. The p-value of 
a test is the probability of obtaining an outcome at least as extreme as the one that was actually observed, 
under the assumption that Hq is true. 

In a multi- hypothesis statistical test, the outcome of an experiment is used to test simultaneously a 
number of hypotheses. For example, in the context of frequent itemsets, if we seek significant fc-itemsets, we 
are in principle testing (?) null hypotheses simultaneously, where each null hypothesis corresponds to the 
support of a given itemset not being statistically significant. In the context of multi- hypothesis testing, the 
significance level cannot be assessed by considering each individual hypothesis in isolation. To demonstrate 
the importance of correcting for multiplicity of hypotheses, consider a simple real dataset of 1,000,000 
transactions over 1,000 items, each with frequency 1/1,000. Assume that we observed that a pair of items 
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appears in at least 7 transactions. Is the support of this pair statistically significant? To evaluate the 
significance of this discovery we consider a random dataset where each item is included in each transaction 
with probability 1/1,000, independent of all items. The probability that the pair is included in a given 
transaction is 1/1,000,000, thus the expected number of transactions that include this pair is 1. A simple 
calculation shows that the probability that appears in at least 7 transactions is about 0.0001. Thus, it 
seems that the support of in the real dataset is statistically significant. However, each of the 499,500 
pairs of items has probability 0.0001 to appear in at least 7 transactions in the random dataset. Thus, even 
under the assumption that items are placed independently in transactions, the expected number of pairs 
with support at least 7 is about 50. If there were only about 50 pairs with support at least 7, returning the 
pair as a statistically significant itemset would likely be a false discovery since its frequency would be 
better explained by random fluctuations in observed data. On the other hand, assume that the real dataset 
contains 300 disjoint pairs each with support at least 7. By the Chernoff bound [18] , the probability of that 
event in the random dataset is less than 2 -300 . Thus, it is very likely that the support of most of these pairs 
would be statistically significant. A discovery process that does not return these pairs will result in a large 
number of false negative errors. Our goal is to design a rigorous methodology which is able to distinguish 
between these two scenarios. 

A natural generalization of the significance level to multi-hypothesis testing is the Family Wise Error 
Rate (FWER), which is the probability of incurring at least one Type I error in any of the individual tests. If 
we have m simultaneous tests and we want to bound the FWER by a, then the Bonfcrroni method tests each 
null hypothesis with significance level a/m. While controlling the FWER, this method is too conservative 
in that the power of the test is too low, giving many false negatives. There are a number of techniques that 
improve on the Bonferroni method, but for large numbers of hypotheses all of these techniques lead to tests 
with low power (see [7] for a good review). 

The False Discovery Rate (FDR) was suggested by Benjamini and Hochberg [3] as an alternative, less 
conservative approach to control errors in multiple tests. Let V be the number of Type I errors in the 
individual tests, and let R be the total number of null hypotheses rejected by the multiple test. Then we 
define FDR to be the expected ratio of erroneous rejections among all rejections, namely FDR = E[V/R], 
with V/R — when R = 0. Designing a statistical test that controls for FDR is not simple, since the FDR is 
a function of two random variables that depend both on the set of null hypotheses and the set of alternative 
hypotheses. Building on the work of [3], Benjamini and Yekutieli [5] developed a general technique for 
controlling the FDR in any multi-hypothesis test (see Theorem [5] in Section l3~Tj) . 

1.3 Our Results 

We address the classical problem of mining frequent itemsets with respect to a certain minimum support 
threshold, and provide a rigorous methodology to establish a threshold that guarantees, in a statistical sense, 
that the returned family of frequent itemsets contains significant ones with a limited FDR. Our methodology 
crucially relies on the following Poisson approximation result, which is the main theoretical contribution of 
the paper. 

Consider a dataset 2? of t transactions on a set I of n items and let T> be a corresponding random dataset 
according to the random model described in Section 11.11 Let Qk, s be the observed number of fc-itemsets 
with support at least s in 2?, and let Qk, s be the corresponding random variable for T>. We show that there 
exists a minimum support value s m i n (which depends on the parameters of T> and on fc), such that for all 
s > s m ; n the distribution of Qk, s is well approximated by a Poisson distribution. Our result is based on a 
novel application of the Chen-Stein Poisson approximation method [3]. 

The minimum support s m - ln provides the grounds to devise a rigorous method for establishing a support 
threshold for mining significant itemsets, both reducing the overall complexity and improving the accuracy 
of the discovery process. Specifically, for a fixed itemset size fc, we test a small number of support thresholds 
s > s m ; n , and, for each such threshold, we measure the p-value corresponding to the null hypothesis Hq 
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that the observed value Qk, s comes from a Poisson distribution of suitable expectation. From the tests we 
can determine a threshold s* such that, with user-defined significance level a, the number of /c-itemsets 
with support at least s* is not sampled from a Poisson distribution and is therefore statistically significant. 
Observe that the statistical significance of the number of itcmsets with support at least s* does not imply 
necessarily that each of the itemsets is significant. However, our test is also able to guarantee a user-defined 
upper bound /3 on the FDR among all discoveries. We remark that our approach works for any fixed 
itemset size k, unlike traditional frequent itemset mining, where itemsets of all sizes are extracted for a given 
threshold. 

To grasp the intuition behind the above approach, recall that a Poisson distribution models the number of 
occurrences among a large set of possible events, where the probability of each event is small. In the context 
of frequent itemset mining, the Poisson approximation holds when the probability that an individual itemset 
has support at least s m ; n in T> is small, and thus the existence of such an event in T> is likely to be statistically 
significant. We stress that our technique discovers statistically significant itemsets among those of relatively 
high support. In fact, if the expected supports of individual itemsets vary in a large range, there may 
exist itemsets with very low expected supports in T> which may have statistically significant supports in T>. 
These itemsets would not be discovered by our strategy. However, any mining strategy aiming at discovering 
significant, low-support itemsets is likely to incur high costs due to the large (possibly exponential) number 
of candidates to be examined, although only a few of them would turn out to be significant. 

We validate our theoretical results by mining significant frequent itemsets from a number of real datasets 
that are standard benchmarks in this field. Also, we compare the performance of our methodology to a 
standard multi- hypothesis approach based on [5], and provide evidence that the latter often returns fewer 
significant itemsets, which indicates that our method has considerably higher power. 

1.4 Related Work 

A number of works have explored various notions of significant itcmsets and have proposed methods for their 
discovery. Below, we review those most relevant to our approach and refer the reader to |14[ Section 3] for 
further references. Aggarwal and Yu [T] relate the significance of an itemset X to the quantity ((l—v(X))/(l— 
E[v(X)])) ■ (Ei[v(X)]/v(X)), where v(X) represents the fraction of transactions containing some but not all 
of the items of X, and E[u(A)] represents the expectation of v(X) in a random dataset where items occur in 
transactions independently This ratio provides an empirical measure of the correlation among the items of 
X that, according to the authors, is more effective than absolute support. In 8,9, 24 , the significance of an 
itemset is measured as the ratio R between its actual support and its expected support in a random dataset. 
In order to make this measure more accurate for small supports, [SKH] propose smoothing the ratio R using 
an empirical Bayesian approach. Bayesian analysis is also employed in [22] to derive subjective measures of 
significance of patterns (e.g., itemsets) based on how strongly they "shake" a system of established beliefs. 
In [16] , the significance of an itemset is defined as the absolute difference between the support of the itemset 
in the dataset, and the estimate of this support made from a Bayesian network with parameters derived 
from the dataset. 

A statistical approach for identifying significant itemsets is presented in |23j , where the measure of interest 
for an itemset is defined as the degree of dependence among its constituent items, which is assessed through 
a x 2 test. Unfortunately, as reported in [8j[9], there are technical flaws in the applications of the statistical 
test in |23j . Nevertheless, this work pioneered the quest for a rigorous framework for addressing the discovery 
of significant itemsets. 

A common drawback of the aforementioned works is that they assess the significance of each itemset 
in isolation, rather than taking into account the global characteristics of the dataset from which they are 
extracted. As argued before, if the number of itemsets considered by the analysis is large, even in a purely 
random dataset some of them are likely to be flagged as significant if considered in isolation. A few works 
attempt at accounting for the global structure of the dataset in the context of frequent itemset mining. The 
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Dataset 


n 


[/min ; /max] 


m 


t 


Retail 


16470 


[1.13e-05 


0.57] 


10.3 


88162 


Kosarak 


41270 


[1.01e-06 


0.61] 


8.1 


990002 


Bmsl 


497 


[1.68e-05 


0.06] 


2.5 


59602 


Bms2 


3340 


[1.29e-05 


0.05] 


5.6 


77512 


Bmspos 


1657 


[1.94&-Q6 


0.60] 


7.5 


515597 


Pumsb* 


2088 


[2.04e-05 


0.79] 


50.5 


49046 



Table 1: Parameters of the benchmark datasets: n is the number of items; [/ m i n , / max ] is the range of 
frequencies of the individual items; m is the average transaction length; and t is the number of transactions. 



authors of |10j propose an approach based on Markov chains to generate a random dataset that has identical 
transaction lengths and identical frequencies of the individual items as the given real dataset. The work 
suggests comparing the outcomes of a number of data mining tasks, frequent itemset mining among the 
others, in the real and the randomly generated datasets in order to assess whether the real datasets embody 
any significant global structure. However, such an assessment is carried out in a purely qualitative fashion 
without rigorous statistical grounding. 

The problem of spurious discoveries in the mining of significant patterns is studied in [BJ. The paper 
is concerned with the discovery of significant pairs of items, where significance is measured through the 
p- value, that is, the probability of occurrence of the observed support in a random dataset. Significant pairs 
are those whose p-values are below a certain threshold that can be suitably chosen to bound the FWER, or 
to bound the FDR. The authors compare the relative power of the two metrics through experimental results, 
but do not provide methods to set a meaningful support threshold, which is the most prominent feature of 
our approach. 

Beyond frequent itemset mining, the issue of significance has also been addressed in the realm of dis- 
covering association rules. In |13j . the authors provide a variation of the well-known Apriori strategy for 
the efficient discovery of a subset A of association rules with p-value below a given cutoff value, while the 
results in jT7] provide the means of evaluating the FDR in A. The FDR metric is also employed in [27J for 
the discovery of significant quantitative rules, a variation of association rules. None of these works is able to 
establish support thresholds such that the returned discoveries feature small FDR. 



1.5 Benchmark datasets 

In order to validate the methodology, a number of experiments, whose results are reported in Section 01 
have been performed on datasets which are standard benchmarks in the context of frequent itemsets mining. 
The main characteristics of the datasets we use are summarized in Table Q] A description of the datasets 
can be found in the FIMI Repository (http://fimi.cs.helsinki.fi/data/), where they are available for 
download. 



1.6 Organization of the Paper 

The rest of the paper is structured as follows. Section [5] presents the Poisson approximation result for the 
random variable Qfe.s- The methodology for establishing the support threshold s* is presented in Section [3] 
and experimental results are reported in Section 01 Section [5] ends the paper with some concluding remarks. 
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2 Poisson Approximation Result 



The Chen-Stein method [3] is a powerful tool for bounding the error in approximating probabilities associated 
with a sequence of dependent events by a Poisson distribution. To apply the method to our case, we fix 
parameters k and s, and define a collection of (?) Bernoulli random variables {Zx \ X C X, \X\ = k}, such 
that Zx = 1 if the /c-itcmset X appears in at least s transactions in the random dataset T>, and Zx = 
otherwise. Also, let px = Pr(Zx = 1). We are interested in the distribution of Qk, s = J2x-\x\=k Z x- 
For each set X we define the neighborhood set of X, 

i(x) = {x' \xnx'?Q,\x'\ = \x\}. 

If Y G" 1{X) then Zy and Zx are independent. The following theorem is a straightforward adaptation 
of [21 Theorem 1] to our case. 



Theorem 1. Let U be a Poisson random variable such that E[C7] = E[Qfc )S 
distance between the distributions £(Qk, s ) of Qk, s and C(U) ofU is such that 



A < oo. The variation 



C{Q k<s )-C(U) 



= sup |Pr(Q fc „ 

A 

< h+b 2 , 



e A)-Pr(U G A)\ 



where 



h = X] X! PxPY 

X:\X\=kYeI(X) 



b 2 = E l z xZ Y }. 

X:\X\=k X^Yel(X) 

We can derive analytic bounds for b± and 62 in many situations. Specifically, suppose that we generate t 
transactions in the following way. For each item x, we sample a random variable R x G [0, 1] independently 
from some distribution R. Conditioned on the -R^'s, each item x occurs independently in each transaction 
with probability R x . In what follows, we provide specific bounds for this situation that depend on the 
moment E[i? 2s ] of the random variable R. 

As a warm-up, we first consider the specific case where each R x is a fixed value p = 7/71 for some 
constant 7 for all x. That is, each item appears in each transaction with a fixed probability p, and the 
expected number of items per transaction is constant. The more general case follows the same approach, 
albeit with a few more technical difficulties. 

Theorem 2. Consider an asymptotic regime where as n 00, we have k,s = 0(1) with s > 2, each item 
appears in each transaction with probability p — 7/n for some constant 7, and t — 0(n c ) for some positive 
constant < c < (k — 1)(1 — 1/s). Let U be a Poisson random variable such that ~E[U] = E[Qfc iS ] = A < 00. 
Then the variation distance between the distributions C(Qk, s ) ofQk. s and C(U) of U satisfies 

£(Q k:S )-£(U)\\=0(l/n 2s - 2 ). 

Proof. For a given set X of k items, let px,i be the probability that S appears in exactly i transactions, so 
that p x = Yh= s Px,i and 



Px,i 



Applying Theorem Q] gives 



C(Q k , s )-C{U) <b, + b 
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where 

h = ^2 pxpY 

X:\X\=k Yel(S) 

and 

b 2 = Yl V[Z X Z Y ]. 

X:\S\=kY^XeI(S) 

We now evaluate 61 and b 2 . A direct calculation easily gives the value for b\ given in the statement of 
the theorem. For the asymptotic analysis, we write 



2 



- k 



2 I (n-k\ 
I 1 V k I 



= e(n fe ) 2 -e(i/n) = e(n 2fe - 1 ) 

and 

— (:)© h H*)T 

= 6(t s ) ■ Q(n- ks ) • (1 + o(l)) = 6 (t s n- ks ) , 

where we have used the fact that t = o(n k ) to obtain the asymptotics for the third term. Also, we note that 
for any 1 < i < t 

Px,i+i t-i /7> 



i + 1 \n) V 



Px.i i + 
and so 

max ^i±i =0(tn- k ) = 0(l/n). 
ie{s, s +i,...,t-i} p x ,i 

Using a geometric series, it follows that 



Px = $>jr,i - Px,a(l + o(l)) = 6 (t s n- ks ) . 



Thus, we obtain 



h = e(n 2k - 1 )-e(t s n- ks ) 
= e^'n 2 *^-*)- 1 ) = e(n 2cs+2fe ( 1 - s )- 1 ). 

We now turn our attention to 62- Consider sets X ^ Y of k items, let g = \X fl Y|, and suppose 
that g > 0. Then if ZxZy = 1, there exist disjoint subsets A,B,C £ {l,...,i} such that < |A| < s, 
|B| = C = s — \A\, all of the transactions in A contain both X and Y, all of the transactions in B contain 
X, and all of the transactions in C contain Y. 

Therefore, 



nz x z Y] <±( 1 \rrf k -^^ 

^ \i; s — i; s — i) \n/ 
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where the notation ( m ) is a shorthand for ( m ) ( m - s ) ( m -^ 
It follows that 



k-l 



^ 2 <E 



' \g;k- g;k- g 



E 



«; s — i\s — i j \n 



-y ^ (2k-g)i+2k(s-i) 



k-l 



E 



{\9;k- g;k- gj \n 



2k* 



E 



t 



91 



k-l 



E 



«; s — i\ s — ij \7 

2ks 



n 



\\9]k- g;k- gj \n 



i; s — i; s — i / \ 7 

s 2fes 

/; / 



fe-i 

^ e(n 2fe- S+ 2c S) (7 

9=1 

fc-1 



E< 



9' 



0(n 2/c(l- S )+2c S) ^ n - S ^ 7 



9i n (9-c)i 



g=l i=0 
k-l 

-g 



6(1) g<c 

e( n (ff-c)s) ff > c 



e(n 2fc(l- S )+2c S) ^ n 
9=1 

9( n 2fc(l- S )+2c^ . q/ -(fe_l) + (fe_l- c ) s x 
Q^ rl 2fc(l-s)+s(fe-l+c)-fe+l^ 



Note that, in the summation where there are two cases depending on whether g < c or g > c, we have 
used the assumption that c < (k — 1)(1 — 1/s) to ensure the next equality. Finally, it is simple to check that 
both &i and b 2 are 0(l/n 2s ~ 2 ) if c < (fc - 1)(1 - 1/s). □ 

We now provide the more general theorem. 

Theorem 3. Consider an asymptotic regime where as n — » oo, we have k,s = 0(1) with s > 2, E[i? 2s ] = 
0(n~ a ) for some constant 2 < a < 2s, and t — 0(n c ) for some positive constant c. Let U be a Poisson 
random variable such that E[[/] = E[(5fc. s ] = A < oo. // 



c < 



(k - l)(a - 2) + min(2a - 6, 0) 
2^ : 



then the variation distance between the distributions C(Qk. s ) of Qk, s and C(U) of U satisfies 

C(Q ktS )-C(U) =0(l/7i). 



Proof. Applying Theorem Q] gives 



C{Q Ks ) - C(U) 



<h + b 2 
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where 

h = ^2 pxpY 

X:\X\ = kY£l(X) 

and 

X:\X\=kY^XeI(X) 

We now evaluate b\ and &2- Letting R denote the vector of the i? x 's, we have that for any set X of k 
items 

Vr{Z x =1\R)< (*) J] Rl 

xex 



Since the R x 's are independent with common distribution R 



V x = E[Pr(Z x = 1 | R)} < y E[R s ] k . 
Using Jensen's inequality, we now have 

bi = ^2 PxPY 

X:\X\=kY£l(X) 



2 / /n-fc 



(V) a ft 



*/ v (*) 



->2slfc 



2slfe 



;)'('-n^)C)V 

0(n k ) 2 • 0(l/n) • <3(n 2cs ) • 0(rj- fea ) 
0(n fc(2_a)+2c;s_1 ) 



We now turn our attention to 6 2 - Consider sets X ^ Y of k items, and suppose g = \X f)Y\ > 0. If 
Z X Z Y = 1, there exist disjoint subsets A,B,C e such that < \A\ < s, |B| = \C\ = s - \A\, 

all of the transactions in A contain both X and Y, all of the transactions in B contain X, and all of the 
transactions in C contain Y. Therefore, 



*(n«r) (ll" ] 

\xex J \ y eY J 



f n W) ( n rs v 

\xeX-Y / \yeY-X 
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Applying independence of the J? x 's and Jensen's inequality gives 

E[Z x Zy] = E[E[Z X Z Y | R}} 

- /Z (j. s _ *. a _ f) E[i? 2s "TE[i? s ] 2 ( fe -^ 
< ^i 2s - l E[i? 2s ] £ ^E[i? 2s ]' £ - 9 



i=0 



i=0 

s 

< Q(l) n (2s-i)c-g(k-ig/2s) 
*=0 

s 

= 0(n 2sc - afe )^V(S- c ) 

i=0 

= ^ n 2sc-afe+max{0,s(|f-c)} 



It follows that 



6 2 < V ( " )0 (' n 2«c-afc+n U «{0,.(3f-c)} , j 

k—1 

= (n 2k+2sc - ak ) J2 n- g O (n max {°^(?f- c )}) 



9=1 

Now, for 2sc/a < g < k, we have (using the fact that a > 2) 

n -% max {°< s (if- c )} = ^(f-^-sc < n (fe- 1 )(f- 1 )-sc i 

Thus 

b 2 = 0( n 2 fc+«c-afc+(k-i)(§-ih 

(Here we are using the fact that our choice of c satisfies c < (k — l)(a — 2)/2s to ensure that n^^f - 1 )- cs 
0(1).) 



Now, we have 



since 



and 



since 



Thus 



61 = 0(l/n) 

(fc-l)(o-2) fc(o-2) 
C ~ 2s ~ 2s ' 

62 = 0(l/n) 
fc(o - 2) + (a - 4) 



c < 



2s 

61 + 62 = 0(l/n). 
10 



□ 



It is easy to see that for fixed k, the quantities b\ and 62 defined in Theorem [T] are both decreasing in 
s. In the following, we will use the notation b\ (s) and 62(5) to indicate explicitly that both quantities are 
functions of s. Therefore, for a chosen e, with < e < 1, we can define 

Smin = min{s > 1 : b\ (s) + b 2 (s) < e}. (1) 

It immediately follows that for every s in the range [s m i n , 00), the variation distance between the distri- 
bution of Qk.s and the distribution of a Poisson variable with the same expectation is less than e. In other 
words, for every s > s m ; n the number of fc-itemsets with support at least s is well approximated by a Poisson 
variable. Theorems [2] and proved above establish the existence of meaningful ranges of s for which the 
Poisson approximation holds, under certain constraints on the individual item frequencies in the random 
dataset and on the other parameters. 



2.1 A Monte Carlo method for determining s m i n 

While the analytical results of the previous subsection require that the individual item frequencies in the 
random dataset be drawn from a given distribution, in what follows we give experimental evidence that the 
Poisson approximation for the distribution of Qk, s holds also when the item frequencies are fixed arbitrarily, 
as is the case of our reference random model. More specifically, we present a method which approximates 
the support threshold s m { n defined by Equation Q] based on a simple Monte Carlo simulation which returns 
estimates of &i(s) and 62(5)- This approach is also convenient in practice since it avoids the inevitable slack 
due to the use of asymptotics in Theorem [3J 

For a given configuration of item frequencies and number of transactions, let s be the maximum expected 
support of any fc-itemset in a random dataset sampled according to that configuration, that is, the product 
of the k largest item frequencies. Conceivably, the value &i(s) is rather large, hence it makes sense to search 
for an s m i n larger than s. We generate A random datasets and from each such dataset we mine all of the 
fc-itemsets of support at least s. Let W be the set of itemsets extracted in this fashion from all of the 
generated datasets. For each s > s we can estimate bi(s) and &2(s) by computing for each X £ W the 
empirical probability px of the event Zx = 1, and for each pair X, Y G W, with X n Y 7^ 0, the empirical 
probability px.Y of the event (Zx = 1) A [Zy = 1). Note that for itemsets not in W these probabilities 
are estimated as 0. If it turns out that 61 (s) + > e/4, then we let s m ; n be the minimum s > s such 

that bi(s) + 62(5) < e/4. Otherwise, if b\ (s) + 62 (s) < e /4, we repeat the above procedure starting from 
s/2. (Based on the above considerations this latter case will be unlikely.) Algorithm 1 implements the above 
ideas. 

The following theorem provides a bound on the probability that s m i n be a conservative estimate of s m j n , 
that is, s m j n ^ s m j n . 

Theorem 4. If A = O (log(l/J)/e), the output s m i n of the Monte-Carlo process satisfies 

Pr(&l(Smin) + 6 2 (Smin) < e) > 1 - 6. 

Proof. Let assume &i(s m ; n ) + b2(s mm ) > e. Note that &i(s m in) < &2(smin), therefore we have ^(smin) > e /2. 
Let B be the random variable corresponding to A times the estimate of &2(smin) obtained with Algorithm 
1. Thus E[B] > Ae/2. Since Algorithm 1 returns s m i n as estimate of s m j n , we have that B ^ A.e/4. Let 

A _ 81og(l/^ 
e 

and c < 1 be such that: 

(1 - c)E[B] = Ae/4. 
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Since E[B] > Ae/2, we have c > 1/2. Using Chernoff bound, we have that: 

c 2 E[B] 

Pr(B < Ae/4) < e ~ 



1 8Iog(l/i) 

< e 4 2 < 0. 

Thus Pr(6i(s min ) + 6 2 (s min ) > e) < 6. □ 
Algorithm 1 FindPoissonThreshold 

Input: Dataset V of t transactions over n items, vector / of item frequencies, fc, A, e; 
Output: Estimate s m in of s m i n ; 
1: s <— highest expected support of a fc-itemset; 

^max ^ 0, 

W ^0; 

for i <— 1 to A do 

Z?i <— random dataset with parameters i,n,/; 
<— U | frequent fc-itemsets in f>i w.r.t. s|; 

if W = then 

s «- s/2; 
gotogl 
if (smax = 0) then 



9 
10 
11 



max 

xew,v. 



| support of X in X>j | + 1; 



for s <- s to s max do 
for all X e W do 

<— empirical probability of = 1}; 
for all X, Y G 1U : X n Y ^ do 

px,y(s) <— empirical probability of {Zx,y = 1}; 
fo i( s ) «- ^2 Px(s)py(s); 



X,YeW;YeI(X) 

18: b 2 (s) <- 51 

19: if &i(s) + & 2 (s) < e/4 then 

20: 5 max ^ S, 

21: s «- s/2; 

22: goto El 

23: s min <- min{s > s : & x (s) + & 2 (s) < e/4}; 

24: return s m i n ; 



For each dataset V of Tableland for itemset sizes fc = 2, 3, 4, we applied Algorithm 1 setting A = 1, 000 
and e = 0.01. The values of s m i n we obtained are reported in Tabled (we added the prefix "Rand" to 
each dataset name, to denote the fact that the dataset is random and features the same parameters as the 
corresponding real one). 
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Dataset 



^min 

Y^2 fc^3 fc^T 



RandRetail 
RandKosarak 



9237 
273266 

268 

168 
76672 
29303 



4366 
100543 
23 
13 
15714 
21893 



784 
20120 
5 
4 

2717 
16265 



RandBmsl 
RandBms2 



RandBmspos 
RandPumsb* 



Table 2: Values of s m j n for e = 0.01 and for k — 2,3, 4, in random datasets with the same values of n, t, 
and with the same frequencies of the items as the corresponding benchmark datasets. 

3 Procedures for the discovery of high-support significant item- 
sets 

For a give itemset size fc, the value s m - m identifies a region of (relatively high) supports where we concentrate 
our quest for statistically significant fc-itemsets. In this section we develop procedures to identify a family 
of fc-itemsets (among those of support greater than or equal to s m - m ) which are statistically significant with 
a controlled FDR. More specifically, in Subsection 13.11 we show that a family with the desired properties 
can be obtained as a subset of the frequent fc-itemsets with respect to s m i n , selected based on a standard 
multi-comparison test. However, the returned family may turn out to be too small (i.e., the procedures 
yields a large number of false negatives). To achieve higher effectiveness, in Subsection 13.21 we devise a more 
sophisticated procedure which identifies a support threshold s* > s m - m such that all frequent fc-itemsets 
with respect to s* are statistically significant with a controlled FDR. In the next section we will provide 
experimental evidence that in many cases the latter procedure yields much fewer false negatives. 

3.1 A procedure based on a standard multi-comparison test 

We present a first, simple procedure to discover significant itemsets with controlled FDR, based on the 
following well established result in multi-comparison testing. 

Theorem 5 ( [5]). Assume that we are testing for m null hypotheses. Let < p( 2 ) < • • • < P(m) be the 
ordered observed p-values of the m tests. For a given parameter (3, with < j3 < 1, define 



and reject the null hypotheses corresponding to tests (1), ...,(£). Then, the FDR for the set of rejected null 
hypotheses is upper bounded by f3. 

Let T> denote an input dataset consisting of t transactions over n items, and let fc be the fixed itemset size. 
Recall that s m j n is the minimum support threshold for which the distribution of Qk, s is well approximated 
by a Poisson distribution. First, we mine from T> the set of frequent fc-itemsets F(k) (s m in)- Then, for each 
X E J-(k)( s min), we test the null hypothesis that the observed support of X in V is drawn from a 
Binomial distribution with parameters t and fx (the product of the individual frequencies of the items of 
X), setting the rejection threshold as specified by condition ([2]), with parameters (3 and m = (^). Based 
on Theorem [5j the itemsets of J r (fc)(s m i n ) whose associated null hypothesis is rejected can be returned as 
significant, with FDR upper bounded by (3. The pseudocode Procedure [T] implements the strategy described 
above. 




(2) 
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Procedure 1 



Input: Datasct V of t transactions over n items, vector / of item frequencies, k, /3 € (0, 1); 
Output: Family of significant fc-itemsets with FDR < j3; 

Determine s m i n and compute J r (fe)(s m ; n ) from V; 

for all X 6 J"( fe )(s m m) do 
sx <— support of X in V; 

fx <— n ie x/i; 

pW <-Pr(Bm{t,fx)>8x); 
Let P(i),P(2), • ■ • , be the sorted sequence of the values p^ x \ with X e J-(k){ s min)' : 

ro <-(*): 

£ = max jo,i ■ P(i) < m y_ t i /?}; 

return {X g J"(fe)(s m in) : = 1 < « < ^}; 



3.2 Establishing a support threshold for significant frequent itemsets 

Let a and (3 be two constants in (0, 1). We seek a threshold s* such that, with confidence 1 — a, the fc-itemsets 
in /Vfe)(s*) can be flagged as statistically significant with FDR at most (3. The threshold s* is determined 
through a robust statistical approach which ensures that the number Qk. s * — |^"(fc) ( s *)\ deviates significantly 
from what would be expected in a random dataset, and that the magnitude of the deviation is sufficient to 
guarantee the bound on the FDR. 

Let s m ; n be the minimum support such that the Poisson approximation for the distribution of Qk. s holds 
for s > s m in, and let s max be the maximum support of an item (hence, of an itemset) in V. Our procedure 
performs h = Llog 2 (s max — s mm )J + 1 comparisons. Let s = s m ; n and s t = s m ; n + 2 l , for 1 < i < h. In the 
i-th comparison, with < i < h, we test the null hypothesis Hq that the observed value Qk,si is drawn from 
the same Poisson distribution as Qk, Si - We choose as s* the minimum of the Sj's, if any, for which the null 
hypothesis Hq is rejected. 

For the correctness of the above procedure, it is crucial to specify a suitable rejection condition for each 
Hq. Assume first that, for < i < h, we reject the null hypothesis Hq when the p- value of the observed value 
Qk.si is smaller than aj, where the ct^'s are chosen so that ~Y^ = q a% = ex. Then, the union bound shows that 
the probability of rejecting any true null hypothesis is less than a. However, this approach does not yield 
a bound on the FDR for the set ^F^(s*). In fact, some itemsets in J-( fe )(s*) are likely to occur with high 
support even under Hq, hence they would represent false discoveries. The impact of this phenomenon can 
be contained by ensuring that the FDR is below a specified level (3. To this purpose, we must strengthen 
the rejection condition, as explained below. 

Fix suitable values f3o,0i, ■ ■ ■ , Ph-i such that ^2^Zo A _1 — P- For < i < h, let A, = E[Qk,si]- We now 
reject Hq when the p- value of Qk, Si is smaller than Qj, and Qk. Si > PiXi- The following theorem establishes 
the correctness of this approach. 

Theorem 6. With confidence 1 — a, J 7 ^ (s*) is a family of statistically significant frequent k-itemsets with 
FDR at most (3. 

Proof. Observe that since Yl!i=o a i — a > we have that all rejections are correct, with probability at least 
1—a. Let Ei be the event "Hq is rejected" or cquivalently, "the p-value of Qk,si is smaller than ai and 
Qk.si ^ Pi^i "■ Suppose that Hq is the first rejected null hypothesis, for some index i, whence s* — Sj. In this 
case, Qk, Si itemsets are flagged as significant. We denote by Vi the number of false discoveries among these 
Qk, Si itemsets. It is easy to argue that the expectation of Vi is upper bounded by E[Xi\Ei, Ei-\, . . . , E ], 
where Xj, is a Poisson variable with expectation Aj. Since Qk, Si > PiXi when Hq is rejected, by the law of 
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total probability we have 



FDR < 



< 



< 



< 



h-l 

i=0 
h-l 



Vi 



Qk, Si 



^;_|Mp r( r,..r, , Eo! 



t=0 
h-l 



£ g ^l^--^] p r (^, E^,. ..,E ) 



i=0 
h-l 

E 

i=0 
h-l 

E 

i=0 



□ 



The pseudocode Procedure[2]specifies more formally our approach to determine the support threshold s* . 
Note that estimates for the Ai's needed in the for- loop of Lines 7-9 can be obtained from the same random 
datasets generated in Algorithm [TJ which are used there for the estimation of s m i n - 



Procedure 2 

Input: Dataset V of t transactions over n items, vector / of item frequencies, k, a, (3 S (0, 1); 

Output: s* such that, with confidence 1 — a, J r (j.)(s*) is a family of significant fc-itemsets with FDR < /3; 

l: Determine s m i n and compute J 7 (/ C )(s m i n ) from V; 

2: s max 4— maximum support of an item; 

3: i <- 0; s <- ^min j 

4: /l «- Ll0g 2 (Smax - S m i n )J +1; 

5: Fix a , ■ • • , ah-i £ (0, 1) s.t. YliZo «« = a 5 
6: Fix /9b, ... , /Sh-i 6 (0, 1) s.t. ^fo 1 Pr 1 = P\ 
7: for i <— to /i — 1 do 
8: Compute Xi — £[0fc,s«]j 
9: while i < h do 
10: Compute Qk,si ; 

11: if (Pr(Poisson(Ai) > Qk, Si ) < &i) and (Qk, Si > then 
12: return s* «— s,; 

13: s l+ i ^s min + 2 4+1 ; 
14: + 

15: return s* <— oo; 



4 Experimental Results 

In order to show the potential of our approach, in this section we report on a number of experiments 
performed on the benchmark datasets of Table [TJ First, in Subsection 14. 1[ we validate experimentally the 
methodology implemented by Procedure EJ while in Subsection 14. 2\ we compare Procedure [2] against the 
more standard Procedure [JJ with respect to their ability to discover significant itemsets. 
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4.1 Experiments on benchmark datasets 

For each benchmark dataset in Tabled] and for fc = 2,3,4, we apply Procedure [5] with a = (3 = 0.05, and 
oti = /3f = 0.05/h. The results are displayed in Table |31 where, for each dataset and for each value of fc, 
we show: the support s* returned by Procedure [5J the number Qk, s - of fc- itemsets with support at least s* , 
and the expected number A(s*) of itemsets with support at least s* in a corresponding random dataset. 







k = 2 






k = 3 






k = 4 




Dataset 


s* 


Qk,s' 


X(s*) 


s* 


Qk,s* 




s* 


Qk,s» 


*(«*) 


Retail 


oo 








oo 








848 


6 


0.01 


Kosarak 


oo 








oo 








21144 


12 


0.01 


Bmsl 


276 


56 


0.19 


23 


258859 


0.06 


5 


27M 


0.05 


Bms2 


168 


429 


0.73 


13 


36112 


0.25 


4 


714045 


0.01 


Bmspos 


oo 








16226 


22 


0.01 


2717 


891 


0.38 


Pumsb* 


29303 


29 


0.05 


21893 


406 


0.35 


16265 


6293 


1.37 



Table 3: Results obtained by applying Procedure [5] with a = 0.05, (3 = 0.05 and k = 2, 3, 4 to the benchmark 
datasets of Table [TJ 

We observe that for most pairs (dataset, fc) the number of significant frequent fc-itemsets obtained is rather 
small, but, in fact, at support s* in random instances of those datasets, less than two (often much less than 
one) frequent fc-itemsets would be expected. These results provide evidence that our methodology not only 
defines significance on statistically rigorous grounds, but also provides the mining task with suitable support 
thresholds that avoid explosion of the output size (the widely recognized "Achilles' heel" of traditional 
frequent itemset mining). This feature crucially relies on the identification of a region of "rare events" 
provided by the Poisson approximation. As discussed in Section 11.31 the discovery of significant itemsets 
with low support (not returned by our method) would require the extraction of a large (possibly exponential) 
number of itemsets, that would make any strategy aiming to discover these itemsets unfeasible. Instead, we 
provide an efficient method to identify, with high confidence level, the family of most frequent itemsets that 
are statistically significant without overwhelming the user with a huge number of discoveries. 

There are, however, a few cases where the number of itemsets returned is still considerably high. Their 
large number may serve as a sign that the results call for further analysis, possibly using clustering techniques 
[26] or limiting the search to closed itemsets 19J3- For example, consider dataset Bmsl with k = 4 and the 
corresponding value s* = 5 from Table [3] Extracting the closed itemsets of support greater or equal to s* 
in that dataset revealed the presence of a closed itemset of cardinality 154 with support greater than 7 in 
the dataset. This itemset, whose occurrence by itself represents an extremely unlikely event in a random 
dataset, accounts for more than 22M non-closed subsets with the same support among the 27M reported as 
significant. 

It is interesting to observe that the results obtained for dataset Retail provide further evidence for the 
conclusions drawn in [TU], which suggested random behavior for this dataset (although the random model 
in that work is slightly different from ours, in that the family of random datasets also maintains the same 
transaction lengths as the real one). Indeed, no support threshold s* could be established for mining 
significant fc-itemsets with k = 2,3, while the support threshold s* identified for fc = 4 yielded as few as 6 
itemsets. However, the conclusion drawn in |10j was based on a qualitative assessment of the discrepancy 
between the numbers of frequent itemsets in the random and real datasets, while our methodology confirms 
the findings on a statistically sound and rigorous basis. 

Observe also that for some other pairs (dataset, fc) our procedure does not identify any support threshold 
useful for mining statistically significant itemsets. This is an evidence that, for the specific fc and for the 
high supports considered by our approach, these datasets do not present a significant deviation from the 

'An itemset is closed if it is not properly contained in another itemset with the same support. 
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corresponding random datasets. 

Finally, in order to assess its robustness, we applied our methodology to random datasets. Specifically, 
for each benchmark dataset of Tableland for k — 2,3, 4, we generated 100 random instances with the same 
parameters as those of the benchmark, and applied Procedure [5] to each instance, searching for a support 
threshold s* for mining significant itemsets. In Table 2] we report the number of times Procedure [5] was 
successful in returning a finite value for s*. As expected, the procedure returned s* — oo, in all cases but 
for 2 of the 100 instances of the random dataset with the same parameters as dataset Pumsb* with k = 2. 
However, in these two latter cases, mining at the identified support threshold only yielded a very small 
number of significant itemsets (one and two, respectively). 



s* < oo 



Dataset 


k = 2 


k = 3 


k = 4 


RandomRctail 











RandomKosarak 











RandomBms 1 











RandomBms2 











RandomBmspos 











RandomPumsb* 


2 









Table 4: Results for Procedure [2] with a — 0.05, /3 = 0.05 for random versions of benchmark datasets; each 
entry reports the number of times, out of 100 trials, the procedure returned a finite value for s*. 

4.2 Relative effectiveness of Procedures CD and [2] 

In order to assess the relative effectiveness of the two procedures presented in the previous section, we applied 
them to the benchmark datasets of Table [TJ Specifically, we compared the number of itemsets extracted 
using the threshold s* provided by Procedure [21 with the number of itemsets flagged as significant using 
the more standard method based on Benjamini and Yekutieli's technique (Procedure [JJ , imposing the same 
upper bound /3 = 0.05 on the FDR. 

The results are displayed in Table [SJ where for each pair (dataset, k), we report the cardinality of the 
family 1Z of fc-itemsets flagged as significant by Procedure [U and the ratio r = Qk,s»/\7Z\, where Qk,s* is 
the number of fc-itemsets of support at least s* , which are returned as significant with the methodology of 
Subsection 13.21 

We observe that in all cases where Procedure [2] returned a finite value of s* the ratio r is greater than 
or equal to 1 (except for dataset Bmsl and k = 2, and dataset Bmspos and k — 3, where r is however very 
close to 1). Moreover, in some cases the ratio r is rather large. Since both methodologies identify significant 
k- itemsets among all those of support at least s m ; n , these results provide evidence that the methodology 
of Subsection 13.21 is often more (sometimes much more) effective. The methodology succeeds in identifying 
more significant itemsets, since it evaluates the significance of the entire set J-(fc)(s*) by comparing Qk, S ' 
to Qk, s *- In contrast, Procedure [JJ must implicitly test considerably more hypotheses (corresponding to the 
significance all possible fc-itemsets), thus the power of the test (l-Pr(Type-II error)) is significantly smaller. 

Observe that the cases where r = in Table [5] correspond to pairs (dataset, fc) for which Procedure [5] 
returned s* — oo, that is, the procedure was not able to identify a threshold for mining significant fc-itemsets. 
Note, however, that in all of these cases the number of significant fc-itemsets returned by Procedure [1] is 
extremely small (between 1 and 3). Hence, for these pairs, both methodologies indicate that there is very 
little significant information to be mined at high supports. 
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Dataset 


k 


= 2 


k = 


3 


k = 


4 


It? 1 


r 


\T> 1 

\TZ\ 


r 


\T> 1 

\n\ 


r 


Retail 


3 





3 





6 


1.0 


Kosarak 


1 





1 





12 


1.0 


Bmsl 


60 


0.933 


64367 


4.441 


219706 


122.9 


Bms2 


429 


1.0 


25906 


1.394 


60927 


11.72 


Bmspos 


2 





23 


0.957 


891 


1.0 


Pumsb* 


29 


1.0 


406 


1.0 


6288 


1.001 



Table 5: Results using Test[T]to bound the FDR with f3 — 0.05 for itemsets of support > s m in- 

5 Conclusions 

The main technical contribution of this work is the proof that in a random dataset where items are placed 
independently in transactions, there is a minimum support s m j n such that the number of fc-itemsets with 
support at least s m i n is well approximated by a Poisson distribution. The expectation of the Poisson distri- 
bution and the threshold s m ; n are functions of the number of transactions, number of items, and frequencies 
of individual items. 

This result is at the base of a novel methodology for mining frequent itemsets which can be flagged 
as statistically significant incurring a small FDR. In particular, we use the Poisson distribution as the 
distribution of the null hypothesis in a novel multi-hypothesis statistical approach for identifying a suitable 
support threshold s* > s m ; n for the mining task. We control the FDR of the output in a way which takes 
into account global characteristics of the dataset, hence it turns out to be more powerful than other standard 
statistical tools (e.g., [5]). The results of a number of experiments, reported in the paper, provide evidence 
of the effectiveness of our approach. 

To the best of our knowledge, our methodology represents the first attempt at establishing a support 
threshold for the classical frequent itemset mining problem with a quantitative guarantee on the significance 
of the output. 
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